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Abstract 


Background: Parrots belong to a group of behaviorally advanced vertebrates and have an advanced ability of 
vocal learning relative to other vocal-learning birds. They can imitate human speech, synchronize their body 
movements to a rhythmic beat, and understand complex concepts of referential meaning to sounds. However, 
little is known about the genetics of these traits. Elucidating the genetic bases would require whole genome 
sequencing and a robust assembly of a parrot genome. 


Findings: We present a genomic resource for the budgerigar, an Australian Parakeet (Melopsittacus undulatus) — the 
most widely studied parrot species in neuroscience and behavior. We present genomic sequence data that includes 
over 300X raw read coverage from multiple sequencing technologies and chromosome optical maps from a 
single male animal. The reads and optical maps were used to create three hybrid assemblies representing 

some of the largest genomic scaffolds to date for a bird; two of which were annotated based on similarities 





to reference sets of non-redundant human, zebra finch and chicken proteins, and budgerigar transcriptome 
sequence assemblies. The sequence reads for this project were in part generated and used for both the 
Assemblathon 2 competition and the first de novo assembly of a giga-scale vertebrate genome utilizing PacBio 
single-molecule sequencing. 


Conclusions: Across several quality metrics, these budgerigar assemblies are comparable to or better than the 
chicken and zebra finch genome assemblies built from traditional Sanger sequencing reads, and are sufficient to 
analyze regions that are difficult to sequence and assemble, including those not yet assembled in prior bird 
genomes, and promoter regions of genes differentially regulated in vocal learning brain regions. This work 
provides valuable data and material for genome technology development and for investigating the genomics of 
complex behavioral traits. 


Keywords: Melopsittacus undulatus, Budgerigar, Parakeet, Next-generation sequencing, Hybrid assemblies, 
Optical maps, Vocal learning 
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Data description 

Raw genome DNA sequence reads 

DNA samples were obtained from a blood sample taken 
from a single male Melopsittacus undulatus, who we aptly 
named Mr. B. For Illumina sequencing, reads were gener- 
ated at Duke University (16x), Illumina UK (54x), and BGI 
(219x) using Ilumina’s TruSeq [1] version2 or version3 
chemistries (Table 1 and GigaDB [2]). The version3 chem- 
istry reads through GC-rich regions, which are often found 
in promoters, more evenly than does version2 [3]. The in- 
sert sizes for the BGI libraries ranged from 220 bp to 
40 Kbp, and the insert sizes for the Duke libraries ranged 
from 400-600 bp, in order to assist assemblies. Fragment 
sizes for the mate pair libraries, based on genome mapping, 
and the per base sequence quality distribution for the li- 
braries are shown in GigaDB [2]. The Duke University Illu- 
mina libraries were sequenced at two different cluster 
densities: 8x coverage reads at the normal 420 k clusters/ 
mm density and 8x coverage at a lower 350 k clusters/mm. 
The lower cluster density was used to increase the number 
of GC-rich regions sequenced. For PacBio sequencing, 6.76 
Gbp (~5.5x coverage) of PacBio RS reads [4] were gener- 
ated at Pacific Biosciences from two insert size libraries 
(7.5 K bp at 1.93x and 13 Kbp at 3.56x; PacBio reads 
error-corrected with Illumina can be downloaded from the 
supplementary webpage associated with [5]). With all reads 
combined, the total coverage exceeds 300x (assuming a 
haploid genome size of 1.23 Gbp) (Table 1), perhaps mak- 
ing Mr. B one of the most sequenced individual vertebrate 
animals as of to date. The read length distributions of these 
different types of reads are shown in Figure 1. 


Fosmid Library 

To validate the assemblies in the Assemblathon 2 com- 
petition, a fosmid library was created from sheared gen- 
omic DNA (35-40 Kbp) of Mr. B [6]. Ten pools of 
clones were generated and sequenced using Illumina as 
described in [7]. Each pool of reads was individually as- 
sembled using Velvet [8]. The fosmid assemblies have 
been deposited at GigaDB [2]). 


Transcriptome Reads 
454 FLX transcriptome reads were generated from 
brain RNA isolated from two males, neither of whom 


Table 1 Summary of genomic reads 
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was Mr. B. An initial set of sequencing runs of both males 
was conducted at Washington University at St. Louis, pro- 
ducing 89.2 Mb of transcriptome sequence as reported in 
[9] (NCBI accession numbers SRRO29329-—30) and were 
assembled using Newbler [10] into 19,198 contigs. An 
additional 21x coverage (run label GKOK2XFO1) was gen- 
erated at Duke University from one of the males. 


Assemblies 

We present three hybrid assemblies: 1) Budgerigar 
454-illumina hybrid v6.3 using the CABOG assem- 
bler; 2) Budgerigar PBcR hybrid using the CABOG as- 
sembler; and 3) Budgerigar illumina-454 hybrid using 
the SOAPdenovo2 assembler. The first two assemblies 
were annotated, after which, optical-map assisted mega- 
scaffolds were constructed based on them. As of yet, the 
SOAPdenovo2 assemblies have not been annotated or 
aligned to optical maps. The quality statistics of these as- 
semblies are in listed in Table 2, and brief descriptions of 
their construction and relative quality are provided in 
Additional file 1. 


Validating sequence assemblies with optical maps 

Optical Mapping is a single molecule system for the 
construction of ordered restriction maps of whole ge- 
nomes [11], and it has been used to guide and validate 
sequence assemblies [12]. An optical map for the bud- 
gerirgar genome was created, using a method described 
in Additional file 1. The optical map contigs ranged in 
size from 2 Mbp to 74 Mbp and spanned over 900 Mbp 
with a resolution of 13.94 Kbp (i.e, one non-redundant 
Swal every 13.94 Kbp). The contigs were then aligned to 
in silico restriction maps generated from Budgerigar_v6.3 
and PBcR assembly scaffolds in order to validate the scaf- 
folds. An approximate 859.21 Mb of the optical maps 
aligned to the Budgerigar_v6.3 assembly, in 146 scaffolds 
with 3 or more Swal restriction fragments (excluding ends 
and fragments less than 0.4 Kbp). Of these 146 scaffolds, 
43 appeared chimeric (ie., aligned to two or more optical 
map contigs). For the PBcR assembly, 796.63 Mbp optical 
map contigs aligned, in 673 scaffolds. Of the 673 scaffolds, 
only 51 were chimeric. Thus, while the Budgerigar_v6.3 
assembly has a higher N50 scaffold metric and hence lon- 
ger scaffolds compared to the PBcR assembly, 30% the 








Library sizes Total reads Total BP (Mb) Coverage (assuming 1.23 Gbp 
genome size) 
454 Shotgun, 3 kb, 8 kb, 20 kb mate pair 41,898,557 19,736 15.4x 
Illumina 220, 230, 500, 400-600, 800, 2 kb, 561,074,047 356,597 289x 
5 kb, 10 kb, 20 kb, 40 kb paired end 
Pacific Biosciences 7.5Kb, 13 kb 4176,242 6,763 5.5% 
Combined 607,148,846 383,096 309.9x 





Zz0z Aew vz uo ysanB Aq | 16Z89Z/1b-€-XLZL-Z-202/L/¢/eRWe/esueIsebi6/wWoo dnos1wapese//:sdyy Woy pepeojumog 


Ganapathy et al. GigaScience 2014, 3:11 
http://www.gigasciencejournal.com/content/3/1/11 








20Kbp 

8Kbp 

3Kbp 

FLX Titanium 

FLX Titanium XL+ 
PacBio pre-release C2 
Illumina BGI 

Illumina UK 

Illumina Duk e 










Ses ati Rage ies 





























Reads per Bucket (Scaled by Maximum Bucket) 











T T T T T T T T 
50 100 200 500 1000 2000 5000 10000 


Read Length (5bp Buckets) Log—Scaled 


Figure 1 The distribution of read lengths in 454, Illumina and 
PacBio budgerigar sequences. The reads are binned into 5 bp 
buckets based on their lengths, and the fraction of reads 
(normalized by the size of the largest bucket) falling into each 
bucket is shown. Thus, curves shifted towards the right indicate 
longer read lengths. The reads labeled “20 Kbp”, “8 Kbp” and “3 Kbp”, 
“FLX Titanium” and “FLX Titanium XL+” are 454 reads. The reads 
labeled “PacBio pre-release C2” are uncorrected PacBio reads. The 
lumina read lengths appear as colored square boxes, since these 
read lengths are uniform. The “Illumina Duke” reads are of length 76, 
The “Illumina UK” reads are of length 101, and the “Illumina BGI" 
reads are of lengths 90 or 150. The longest reads come from 

PacBio sequencing, followed by 454 FLX + (i.e., FLX Titanium 

XL+) sequencing. 











v6.3 scaffolds are chimeric, whereas only 7.6% of the PBcR 
assembly are chimeric. 


Optical map assisted assemblies 

We took both Budgerigar_v6.3 and PBcR assemblies and 
filtered out alignments that did not extend to the end of 
either the genomic sequence scaffold or the optical map. 
The remaining high-quality alignments were then used 
to identify optical map alignments that bridged scaffolds, 
such that a single optical map aligned to the ends of at 
least two sequence scaffolds. We then iteratively ex- 
tended the megascaffolds beyond pairs of sequence scaf- 
folds, using three heuristics: (1) we limited the overhangs 
(ie., the portion of the scaffold sequence that does not 
align to the optical map) to 2 Mbp total; (2) we bridged 
two scaffolds together only if the size of the gap separating 
them is less than 2 Mbp of Ns; and (3) if a sequence scaf- 
fold aligned to more than one optical map, we placed it 
into the largest one it aligns with. The above procedure 
slightly reduced the number of scaffolds from 25,212 to 
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25,163 in the Budgerigar_v6.3 assembly, and from 54,668 
to 54,138 in the PBcR assembly. This relatively small 
change in number is expected as our procedure tended to 
join only sequence scaffolds that were already fairly large 
into even larger megascaffolds, since it is only possible to 
confidently align an optical map to a fairly large sequence 
scaffold bearing numerous Swal restriction sites. However, 
this analysis substantially improved the scaffold N50 sizes 
from 10.6 Mbp to 13.8 Mbp in the Budgerigar_v6.3, and 
1.7 Mbp to 7.3 Mbp in the PBcR assemblies, respectively 
(Table 2). Without limiting the length of the overhangs 
and gap sizes to 2 Mbp, the increase in N50 scaffold sizes 
in the Budgerigar_v6.3 is 17.1 Mbp (which we think could 
be an artifact). We speculate that some of the large gaps 
in the optical map correspond to centromeres or highly 
repetitive DNA that are difficult to assemble. 


Annotations 

The Budgerigar_v6.3 and PBcR assemblies were annotated 
at BGI for protein coding genes by first generating a refer- 
ence set of human, chicken and zebra finch proteins, and 
then aligning the reference set to the assemblies, and 
propagating annotations to 30% coverage of the reference 
at TBlastN, E=1e°. For the Budgerigar_v6.3 assembly, 
the reference set comprised of human proteins from 
Ensembl 60 and a set of zebra finch and chicken proteins 
re-annotated based on these human proteins, using a cus- 
tom BGI pipeline reported on separately (Jarvis et al. in 
preparation; Zhang et al., in preparation). For the PBcR 
assembly, the reference set comprised of the Ensembl 
60 human, chicken and zebra finch proteins. The 
propagation of these reference sets to the budgerigar 
assemblies is described in more detail in Additional 
file 1. Further, in the PBcR assembly, UTRs were an- 
notated for 6,203 genes using the GKOK2XFO1 tran- 
scriptome runs with a pipeline similar to the one 
described in [13]. The assembly annotations were then 
propagated to the corresponding sets of megascaffolds. 
No de novo gene annotations were performed. 

The annotated Budgerigar assemblies had fewer genes 
(15,470 and 16,204 genes in the Budgerigar_v6.3 and 
PBcR assemblies respectively) than the published Zebra 
Finch (18,618 genes) and Chicken genome assemblies 
(17,108 genes in the 2011 Galgal4 assembly [14]). We 
believe the lower number of annotated genes in budgeri- 
gar assemblies is due to the differences in annotation 
methods rather than assembly completeness, for two 
reasons: (1) These annotations were produced based on 
similarities to zebra finch, chicken and human proteins, 
and hence they cannot contain more genes than the 
source genome annotations; and (2) The independent 
GenScan annotation of the Budgerigar_v6.3 assembly at 
the UCSC Genome Browser contains more genes than 
in zebra finch and chicken, 24,095 in total. 
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Comparisons to other avian assemblies 

Our budgerigar genome assemblies were compared with 
the zebra finch, chicken, and falcon genomes [15-17]. 
The other assemblies from the Assemblathon 2 competi- 
tion are available from GigaDB [18]. The zebra finch and 
chicken had similar contig and scaffold N50 values 
(38.5 kb and 10.4 Mb for zebra finch, and 279.8 kb and 
90.2 Mb for chicken, respectively). In addition, since the 
Peregrine Falcon is the closest relative to parrots [19], 
we also compared the budgerigar genome assemblies to 
this bird. However, it was not possible to do an in depth 
comparison of these genomes to the recently sequenced 
Scarlet Macaw and Puerto Rican Parrot genomes [20,21], 
because both bird genomes had N50 scaffold sizes under 
20,000 and N50 contig sizes under 7,000. A summary of 
assemblies, including the Scarlet Macaw and Puerto Rican 
Parrot, are shown in Table 2. Apart from the standard 
genome assembly quality statistics, we assessed the quality 
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of the budgerigar assemblies along two other dimensions: 
(1) the coverage of highly conserved avian exons, and (2) 
the number of gaps 10 Kbp upstream and downstream of 
each gene (gene territories), and conversely, the number 
gene territories assembled without gaps. Of 3,288 highly 
conserved exons (>86% coverage across >87% of their 
length) we identified between chicken and zebra finch, 
3,165 (96.25%) and 3,134 (95.31%) were covered with >86% 
identity across >87% of their length in the Budgerigar_v6.3 
and PBcR assemblies respectively, pointing to good coverage 
of coding regions in these assemblies. The budgerigar as- 
semblies had fewer gaps within the coding sequences and 
gene territories than all other avian genomes examined, ex- 
cept the newer unpublished Galgal4 chicken assembly that 
is similar to the budgerigar in that it is a hybrid that includes 
both short and long sequences (Sanger and 454 FLX+) 
(Figure 2). This suggests that our budgerigar assemblies 
have very well assembled genes and promoter regions. 
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Figure 2 Number of nucleotide gaps assess relative assembly incompleteness. A) Shows the total number of gaps in genes and the 
surrounding 10,000 base pair regions upstream and downstream (collectively called gene territories). B) Shows the number of such gene 
territories with gaps. In both the panels, different species assemblies are colored differently, with the budgerigar assemblies shown in dark blue. 
The budgerigar assemblies with the “-mega” suffix are optical map enhanced versions of the Budgerigar_v6.3 and PBcR assemblies. The 
budgerigar assemblies have the highest numbers of gapless gene territories (right panel) and the fewest number of gaps of all assemblies except 
the recent chicken v4 assembly, which used a similar technology (left panel). 
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Using the online CoGe tool [22-24], we assessed the 
structural similarities between the various budgerigar as- 
semblies and other avian assemblies [25-30], by computing 
the level of coding sequence synteny among assemblies. 
The highest numbers of genes in synteny were observed, 
as expected, between a budgerigar assembly and the optical 
map assisted version of the same assembly (Figure 3A). 
However, the number of genes in synteny between the 
Budgerigar_v6.3 and the PBcR assemblies was similar to 
the number of genes in synteny between budgerigar and 
falcon (Figure 3A, B). Further, the number of genes in syn- 
teny did not strictly reflect phylogenetic relationships, as 
the zebra finch and budgerigar, close relatives [19], had a 
lower level of synteny than budgerigar and chicken. In 
addition, a number of inversions were observed even in the 
syntenic dotplots between the original budgerigar assem- 
blies and their optical map-assisted assemblies (88 inver- 
sions between Budgerigar_v6.3 and Budgerigar_v6.3_mega; 
209 inversions between PBcR and PBcR_mega, plots 
shown in GigaDB [2]). This suggests that synteny based on 
CoGE syntenic maps is affected by the quality of the as- 
semblies and the characteristics of the synteny algorithm. 
Thus, the number of genes in synteny computed using the 
available methods is only a rough measure of the actual 
structural similarity between the assemblies compared. 

In summary, this study shows that the budgerigar gen- 
omic resource we have generated has provided [5,6] (and is 
still expected to provide more) valuable data and material 
for genome technology development and for further inves- 
tigating complex behavioral traits at the genomics level. 
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All procedures on live animals were approved by the 
Institutional Animal Care and Use Committee of Duke 
University. 


Availability and requirements 

The genomic sequence reads have been deposited in NCBI's 
sequence read archives (SRA) and the EBI’s ENA archive, 
under the same project accession number ERP002324. The 
SOAPdenovo2 assembly has been submitted to GigaDB by 
the Assemblathon 2 team and is available at GigaDB [18]. 
Other supporting resources that have been deposited in 
GigaDB [2] are: 


Duke University brain transcriptome reads. 
Budgerigar_v6.3, PBcR assemblies (contigs and 
scaffolds) and optical map assisted megascaffolds 
based on these two assemblies (two contigs and four 
scaffolds in total). 

e The per base sequence quality distribution of the 
paired end and mate paired libraries. The estimated 
fragment length distribution of the mate paired 
libraries. Peptide and coding sequences (CDS) for 
the Budgerigar_v6.3 and PBcR assemblies. 

e Gene annotations and Repeat Masker annotations 
for the scaffolds. 

e Optical map alignments of Budgerigar_v6.3 and 
PBcR assemblies in Microsoft Excel and XML 
formats and software (Gnomspace.rar) to view the 
XML alignments. 

e The optical map dataset. 


Assembly groups 
budgerigar_v6.3 PBcR 
budgerigar” v6.3-mega ca PBcR-mega 
non-budgerigar 





11000 4 - 


10000 4 r 


9000 4 r 
8000 4 f | r 
0 oy 











~\ oo 
mm 
we? 


ee? ead 
Ne “ (oP oe & .) 
> a er e*s 
es: aor 
we ae” et a oe 
sor : 6 
Cer oe Sd 3) 





of values. 


Figure 3 The number of genes that are part of a syntenic block between different budgerigar assemblies (A) and between budgerigar 
and non-budgerigar assemblies (B). The numbers were calculated from CoGE syntenic dotplots (not shown), as the total number of genes 
represented in syntenic blocks. The y-axis limits have been cut off close to the minimum value in the plot to show a more detailed spread 
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Additional file 





| Additional file 1: Supplementary materials. 
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